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ACOUSTIC  CUES  IN  NATURAL  SPEECH  THEIR  NATURE  AND  POTENTIAL  USES  IN  SPEECH 

RECOGNITION 


Introduction 


This  report  summarizes  the  work  carried  out  by  Haskins  Laboratories 
between  the  years  1973  to  1976  on  a research  contract  to  study  those  acoustic 
cues  in  natural  speech  that  are  of  potential  use  for  speech-recognition  pur- 
poses . 

Objectives  - The  research  carried  out  under  this  program  had  two  principal 
objectives. 

(a)  To  carry  out  basic  reseach  on  automatically  identifying  the 
acoustic  cues  of  natural  speech.  This  research  was  aimed  to 
bear  directly  on  the  scientific  problems  encountered  in  build- 
ing a Speech  Understanding  System. 

(b)  To  work  closely  with  the  principal  contractors  in  the  program 
whose  tasks  were  to  build  complete  speech-understanding  systems. 

It  was  expected  that  this  close  cooperation  would  allow  the  re- 
search results  to  be  quickly  Incorporated  in  these  systems  and 
their  usefulness  to  be  readily  evaluated. 

Summary  of  the  Program’s  Scope  and  Accomplishments 

A practical  speech  understanding  system  must  be  capable  of  converting 
the  spoken  message  into  a linguistic  representation  from  which,  after  con- 
sulting the  appropriate  base  of  stored  information,  a useful  response  can  be 
formulated.  The  basic  structure  of  the  spoken  message  is  composed  of  phonet- 
ic elements  and  the  very  first  step  toward  deriving  a linguistic  represen- 
tation must  be  to  identify  the  phonetic  message.  However,  accurate  auto- 
matic phonetic  identification  solely  on  the  basis  of  acoustic  data  cannot  be 
achieved  by  present  techniques  and,  moreover,  there  is  evidence  that  even  hu- 
man listeners  experience  difficulties  under  similar  conditions.  During  most 
speech  exchanges,  higher-order  information  of  a lexical,  syntactic  semantic 
and  pragmatic  nature  playsa  significant  role  in  human  listener's  abilities 
to  interpret  the  phonetic  content  of  speech.  The  primary  factor  that  dis- 
tinguishes speech  understanding  systems  from  automatic  isolated-word  recog- 
nizers is  that  they  generally  include  algorithms  (called  components)  that 
generate  and  evaluate  hypotheses  at  a variety  of  levels.  These  levels  can 
often  be  Identified  as  being  lexical,  syntactic,  semantic  and  pragmatic  as 
well  as  acoustic-phonetic. 

The  acoustic-phonetic  component  must  convert  the  acoustic  signal  into 
a series  of  hypothesized  phonetic  strings,  each  with  assigned  likelihoods, 
that  represents  the  results  of  a local  acoustic  analysis  of  the  utterance. 

The  system's  performance  at  the  acoustic  level  may  be  augmented  by  an  acoustic 
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verification  component  whose  responsibility  it  is  to  evaluate  the  acoustic 
evidence  in  favor  of  specific  word  or  syllable-hypotheses  that  are  formu- 
lated at  higher  levels  in  the  system  hierarchy.  Similar  procedures  of  analysis 
and  verification  are  often  applied  at  these  higher  levels,  using  evidence  that 
must,  in  the  first  instance,  be  extracted  by  the  acoustic  analyzer  but  is 
sifted  and  evaluated  in  the  light  of  the  system's  current  knowledge  of  the 
topic  being  discussed.  However,  the  scope  of  the  work  to  be  reported  here 
was  concerned  exclusively  with  developing  improved  procedures  for  primary 
acoustic  analysis  and  verification  that  could  serve  systems  having  the  cap- 
abilities of  performing  higher  levels  of  analysis. 

Our  research  began  with  an  exploration  of  the  difficulties  encountered 
by  humans  in  analyzing  the  patterns  of  connected  speech  presented  in  the  non- 
acoustic medium  of  a spectrogram.  The  task  involved  the  analysis  of  an  un- 
known sentence  into  words  which  were  matched  against  those  selected  from  a 
lexicon  of  spectrograms.  The  work  was  conducted  in  the  absence  of  any  se- 
mantic or  syntactic  knowledge  and  therefore  engaged  the  analyst  solely  at 
the  level  of  acoustic  features.  Experience  gained  in  this  experiment  under- 
lined the  difficulties  of  extracting  reliable  acoustic  cues,  particularly  from 
the  unstressed  parts  of  sentences. 

Following  such  early  studies,  a strategy  was  developed  for  organizing 
the  various  stages  of  acoustic  analysis  that  would  be  responsible  for  seg- 
menting and  labeling  sections  of  the  acoustic  signal  in  an  ordered  hierarchy. 
The  sequential  arrangement  featured  the  more  reliable  analyses  first  so  that 
if  errors  were  encountered  in  later  analyses,  only  a limited  amount  of  back- 
tracking would  be  necessary.  Furthermore,  since  the  acoustic  cues  in  stressed 
syllables  are  generally  more  sharply  defined  than  in  unstressed  words  and 
boundaries  cannot  be  directly  identified  in  the  speech  signal,  an  algorithm 
was  developed  to  segment  the  signal  into  syllable-sized  units.  The  output 
units  of  this  segmentation  algorithm  were  subsequently  used  in  an  exploratory 
study  of  methods  for  the  automatic  detection  of  relative  prominence  of  syl- 
lable sequences. 

Algorithms  were  also  developed  for  several  steps  involved  in  the  detailed 
analysis  of  syllabic  units  into  constituent  segments.  In  particular,  our  work 
included  a detailed  study  aimed  at  detecting  likely  transition  points  to  or 
from  nasal  consonants  and  characterizing  the  transition  regions  as  manifes- 
tations of  the  onset  or  termination  of  nasality.  For  the  future,  however, 
it  is  apparent  that  more  detailed  studies  of  other  acoustic  features  are 
required  to  guide  the  construction  of  decision  rules  for  extracting  other 
phonetic  segments. 

In  additional  work,  we  anticipated  the  need  to  retrieve  syllables  from  a 
lexicon  with  the  aid  of  only  an  incomplete  phonetic  representation.  Our  pilot 
study  explored  an  acoustic-verification  technique  that  compared  various  dis-' 
tance  measures  for  their  effectiveness  in  distinguishing  monosyllabic  words 
spoken  by  four  speakers  despite  Interspeaker  differences. 

The  Laboratories'  syllabic  segmentation  algorithm  was  supplied  to  the 
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Systems  Development  Corporation  and  utilized  in  their  speech-understanding 
system.  However,  the  strategy  that  we  proposed  for  phonetic  analysis  was 
not  evaluated  within  any  of  the  major  contractors'  systems  primarily  due  to 
manpower  limitations . Our  strategy  was  based  in  part  on  the  already  available 
results  of  human  speech  perception  studies  and  was  intended  to  be  generaliz- 
able  to  very  large  vocabularies  without  significant  difficulties.  This  latter 
objective  differed  from  that  pursued  by  the  major  systems  builders  who  based 
their  systems  on  vocabularies  of  very  limited  size  and,  unfortunately,  this 
prevented  our  phonetic  analysis  strategy  from  being  fully  implemented  by  any 
of  the  systems  builders.  Moreover,  since  we  suffered  from  a shortage  of 
manpower,  we  were  also  prevented  from  implementing  the  complete  analysis  our- 
selves before  the  program  expired.  Nevertheless,  several  individual  compo- 
nents of  our  syllable-based  speech  analysis  were  evaluated  and  yielded  re- 
sults comparable  or  significantly  better  than  those  attained  by  previously 
used  methods. 

In  addition  to  the  work  on  acoustic  analysis,  the  Laboratories  also 
carried  out  development  work  on  two  major  research  facilities;  the  Digital 
Pattern  Playback — a tool  for  analyzing  and  resynthesizing  speech  signals — 
and  a software  Interface  to  the  ARPA  network  designed  for  the  PDP-11/45  com- 
puter and  RSX-11D  operating  system. 

To  sum  up,  the  primary  focus  of  our  work  was  to  identify  problems  at 
the  acoustic  level  that  automatic  speech  recognition  systems  must  overcome 
and  to  find  promising  methods  for  their  solution  that  could  be  used  by  the 
major  systems  builders.  Although  automatic  speech  recognition  for  large 
vocabularies  is  still  an  unsolved  problem,  we  can  look  back  on  the  work  com- 
pleted under  this  program  with  the  firm  conviction  that  we  now  have  a better 
understanding  of  the  basic  problems  and  have  made  some  important  approaches 
toward  their  eventual  solution. 

This  report  includes  contributions  by  the  following  members  of  the  re- 
search and  technical  staff  of  Haskins  Laboratories:  F.  S.  Cooper,  Principal 

Investigator  and  Staff  Members  P.  W.  Nye,  P.Mermelstein,  J.  Gaitenby,  G.  M. 
Kuhn,  R.  M.  McGuire,  L.  Reiss  and  T.  Montllck.  A perusal  of  the  individual 
research  reports  may  not  give  the  reader  an  adequate  view  of  our  overall 
effort.  Hence,  to  overcome  this  shortcoming  in  the  record,  we  discuss  briefly 
in  what  follows  the  Individual  research  efforts  carried  out  under  this  grant, 
attempt  to  integrate  them  into  the  overall  goals  and  review  the  conclusions 
drawn.  In  each  case  a detailed  report  on  the  work  has  been  written  and  is 
attached  or  cited  for  the  information  of  readers  interested  in  the  experi- 
mental details. 
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RESEARCH  STUDIES  PERFORMED  AT  HASKINS  LABORATORIES 


1.  Speech  Recognition  through  Spectrogram  Matching 

Human  listeners  process  and  Identify  speech  signals  with  an  ease 
(borne  of  much  experience)  that  belies  the  complexity  of  the  task.  The 
true  nature  of  the  complexities  becomes  apparent  quite  quickly  though  when 
the  listener  is  denied  the  use  of  his  ears  and  is  obliged  to  examine  some 
nonacoustic  representation  of  the  speech  signal  to  uncover  the  message.  Not 
only  do  the  problems  come  more  clearly  into  focus,  but  alternative  strategies 
for  solving  the  problems  can  be  explored  with  a view  to  subsequently  apply- 
ing these  strategies  in  computer  algorithms. 

Two  separate  studies  involving  the  use  of  spectrograms  as  the  non- 
acoustic  medium  were  carried  out  under  the  contract.  The  first  (Ingemann 
and  Mermelstein,  1975)  employed  conventional  paper  spectrograms  of  a sen- 
tence occupying  several  seconds  in  length  and  a lexicon  of  about  100  refer- 
ence words.  Experience  soon  showed  that  the  paper  shuffling  task  involved 
in  organizing  a large  number  of  reference  spectrograms  could  easily  become 
unmanageable.  A second  computer -assisted  study  (Nye,  Cooper  and  Mermelstein, 
1975)  was  then  carried  out.  Both  studies  aimed  to: 

(a)  Assess  the  performance  of  humans  in  matching  spectrograms  of  words  in 
sentences  with  spectrograms  of  the  same  words  when  spoken  in  a reference 
context . 

(b)  Study  any  improvements  in  analysis  that  take  place  as  a result  of  sup- 
plying to  the  human  analyzer  with  feedback  spectrograms  (i.e.,  spectrograms 
of  spoken  versions  of  the  analyzer's  chosen  word-sequence  representing  his 
hypotheses  for  the  constituents  of  the  unknown  sentence  frame).  In  speech 
recognition  systems,  such  verifying  feedback  data  could  be  generated  by 
existing  speech  synthesis  techniques. 

The  study  led  to  three  principal  conclusions: 

(i)  Even  when  the  number  of  words  correctly  matched  is  low,  the 

number  of  syllables  in  the  hypothesized  word-sequence  generally 
agrees  with  that  in  the  unknown  sentence. 

(il)  The  recognition  of  monosyllabic  words  is  made  significantly  more 
difficult  by  the  presence  of  a greater  amount  of  phonetically 
similar  words  in  the  reference  vocabulary.  Furthermore,  the 
number  of  correctly  matched  phonemes  is  always  significantly  larger 
than  the  number  of  syllables  (or  words)  because  the  errors  are 
generally  substitutions  of  phonetically  similar  words.  These 
are  Important  facts  to  be  aware  of  because  an  ability  to  discri- 
minate among  many  phonetically  similar  words  Is  an  essential  re- 
quirement for  large-vocabulary  recognition  systems. 

(ill)  The  study  of  the  potential  benefits  of  feedback  of  an  hypothesized 
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sentence-frame  wa<3  Inconclusive.  It  appeared  that  feedback 
can  be  useful  only  If  performance  is  already  relatively  high 
without  feedback.  When  a significant  number  of  errors  is  pre- 
sent in  the  hypothesized  word  sequence  and  the  variation  in 
context  makes  only  a minor  contribution  to  these  errors,  per- 
formance is  not  improved. 

2.  Spectrogram  Reading  of  Vowel-Consonant-Vowel  Sequences. 

The  studies  of  word  matching  in  sentence  contexts  revealed  that  observers 
often  have  difficulty  in  accurately  segmenting  the  speech  signal  into  vocalic 
and  consonantal  segments.  However,  even  after  the  vowel-consonant  segmental 
pattern  is  determined,  additional  problems  remain,  such  as  those  involved  in 
selecting  the  appropriate  consonant  when  no  additional  lexical  information 
is  available.  Such  problems  are,  of  course,  latent  in  the  automatic  recog- 
nition of  continuous  speech  and  it  was  therefore  appropriate  to  conduct  an 
exploratory  study  (Kuhn  and  McGuire,  1974).  The  aims  of  the  study  were  to: 

(a)  Study  the  confusions  among  consonants  that  are  encountered  by  experi- 
enced acoustic-phoneticians  and  then  examine  the  differences  in  the  acoustic 
cues  for  these  consonants  with  the  aid  of  spectrograms. 

(b)  Observe  the  effects  of  concentrated  learning  with  feedback  on  improving 
the  spectrographlc  identif lability  of  tokens  having  the  same  general  phono- 
logical context. 

The  conclusions  of  our  vowel-consonant-vowel  (VCV)  study  were  that: 

(i)  Place  of  production  errors  are  the  most  frequently  encountered 
among  consonant  errors  in  a VCV  environment.  It  is  known  that 
the  spectral  positions  of  place  of  production  cues  are  shifted 
significantly  depending  on  the  vowel  environment.  Manner  and 
voicing  errors  occur  much  less  frequently. 

(11)  In  the  course  of  learning  sessions,  overall  identification  of 

consonants  Improved  significantly  from  75  to  90  percent.  Stops 
and  fricatives  showed  the  largest  improvement.  Identification 
of  nasals  and  semivowels  proved  to  be  more  resistant  to  learning. 

(ill)  Even  after  concentrated  learning  on  cues  exhibited  in  similar 
contexts,  a significant  number  of  errors  remained.  A conven- 
tional spectrographlc  representation  of  the  acoustic  cues  may 
not  be  adequate  for  perfect  recognition.  One  may  have  to  use 
additional  cues  not  easily  seen  in  spectrographlc  presentations. 

3.  Automatic  Segmentation  into  Syllabllc  Units 

Segmentation  of  the  continuous  speech  signal  into  ar ticulatorily ,or 
phonologleally, relevant  units  must  be  one  of  the  flxet  etepe  in  any  analysis 
procedure.  Hence,  following  the  spectrograms  reading  experiments,  an  attempt 
at  finding  a satisfactory  way  of  segmenting  the  speech  aignal  became  a matter 
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of  first  priority.  The  research  study  led  to  the  selection  of  the  syllable 
as  the  most  promising  speech  unit;  having  not  only  a linguistic  Identity  but 
a relatively  stable  physical  manifestation  as  well.  Additional  factors  In 
favor  of  the  syllable  derived  from  the  fact  that  the  interaction  between 
adjacent  syllabic  units  is  much  less  than  that  observed  between  their  con- 
stituent phonetic  segments.  Moreover,  it  appeared  likely  that  more  de- 
tailed analyses  could  be  successfully  made  to  subsequently  concentrate  on 
their  internal  segmental  structure.  Finally,  the  segmentation  of  the  speech 
signal  into  words  could  be  achieved  by  mapping  syllabic-sized  units  into  an 
appropriately  structured  syllable-based  lexicon. 

These  were  the  reasons  that  led  us  to  explore  the  development  of  an  al- 
gorithm that  would  automatically  segment  continuous  speech  signals  into  syl- 
lable sized  units  (Mermelstein,  1975b). 

The  conclusions  drawn  from  our  study  were  that: 

(i)  Syllabic  units  can  be  Isolated  in  continuous  speech  by  simple 
automatic  means.  The  algorithm  was  tested  on  400  syllables  of 
continuous  speech  and  missed  only  6.9  percent  of  the  syllables 
and  inserted  barely  2.6  percent  of  additional  syllables  relative 
to  a nominal,  slow-speech  syllable  count. 

(il)  The  syllabic  boundaries  that  are  chosen  by  the  algorithm,  how- 
ever, do  not  generally  correspond  to  boundaries  assigned  on  the 
basis  of  phonological  criteria. 

4.  A Strategy  for  Acoustic  Analysis 

Having  developed  an  algorithm  that  would  segment  the  speech  signal  into 
syllable-sized  units,  the  next  step  became  the  development  of  a strategy  for 
analyzing  these  units  in  an  effort  to  identify  the  constituent  phonetic  seg- 
ments. The  study  we  undertook  had  several  basic  aims  the  rationale  of  which 
Mermelstein  (1975a)  described  in  his  published  report.  These  objectives  were 
that : 

(a)  Having  Identified  a few  segments,  advantage  should  be  taken  of  the  phono- 
logical constraints  on  adjacent  segments  thus  eliminating  the  necessity  to 
consider  every  phone  as  an  hypothesis  for  each  identified  segment. 

(b)  Since  segmentation  and  labeling  generally  require  similar  analysis 
operations,  our  strategy  should  be  to  combine  the  two  procedures  so  that  the 
signal  is  segmented  only  at  points  where  a labeling  difference  is  found  between 
the  adjacent  segments. 

(c)  The  hypothesized  segmental  constituents  of  syllabic  units  should  be  used 
to  retrieve  similarly  represented  items  from  a reference  syllabary.  Thus  all 
words  that  share  a stressed  syllable  may  be  readily  retrieved  on  the  basis  of 
acoustic  information  from  the  stressed  syllable  alone.  The  hypotheses  may  be 
more  or  less  specific.  Since  the  general  hypotheses  subsume  more  specific 
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ones, too  many  reference  forms  may  be  retrieved  In  response  to  a particular 
hypothesis.  In  such  cases, additional  analyses  would  be  performed  to  reduce 
the  number  of  comparisons  to  reference  items. 

Our  major  conclusions  from  this  study  were  as  follows: 

(1)  An  hierarchical  organization  is  an  effective  device  to  exploit 
our  current  knowledge  about  the  phonological  constraints  within 
syllabic  segments.  Its  special  advantage  Is  that  It  allows  easy 
Integration  of  independent  decision  modules  into  the  overall 
system. 

(ii)  Originally,  we  considered  implementing  the  decision  structure 
in  a deterministic  manner.  However,  since  small  finite  error 
probabilities  are  associated  even  with  early  decisions  a proba- 
bilistic structure  appears  to  be  more  appropriate.  Decisions 
at  every  node  of  the  decision  tree  have  a priori  and  a posteriori 
probability  assignments.  The  best  hypothesis  is  associated  with 
the  highest  a posteriori  probability.  If  the  first  hypothesis 
breaks  down,  branches  with  lower  probability  can  be  followed. 

(ill)  The  speech  signal  is  composed  of  dynamic  segments.  Perceptual 
categorization  of  these  segments  is  based  on  numerous  acoustic 
parameters  whose  detailed  contribution  to  an  individual  decision 
are  not  yet  known.  Accurate  phonemic  categorization  of  segments 
requires  a better  understanding  of  the  perceptual  roles  of  the 
various  acoustic  parameters.  This  appears  to  be  the  strongest 
limitation  to  our  acoustic  phonetic  decoding  capability  today. 

5.  Detecting  Nasals  in  Continuous  Speech 

The  building  of  a detection  component  for  nasal  consonants  was  under- 
taken as  a step  toward  implementing  our  analysis  strategy  for  extracting 
segmental  information  from  the  syllabic  units.  The  work  also  set  out  with 
the  aim  of  taking  advantage  of  the  Information  latent  in  the  phonetic  con- 
text of  segments  (Mermelstein,  1975c). 

Hypotheses  concerning  the  possible  existence  of  nasals  were  formulated 
according  to  the  context-dependent  strategy  outlined  in  Section  4.  He  first 
identified  the  spectral  transition  points  that  could  mark  the  onset  or  termi- 
nation of  nasal-murmur  segments  then  using  the  segmentation  of  the  speech 
stream  into  syllable-sized  units  we  took  advantage  of  the  phonological  con- 
straint that  a syllable  has,  at  most,  two  nasal-murmur  segments;  one  prior 
to  the  syllabic  vowel  and  one  between  the  vowel  and  the  end  of  the  syllable. 
Additionally,  we  investigated  to  what  extent  knowledge  of  the  direction  of 
the  transition,  into  or  out  of  the  nasal,  was  useful  in  attaining  improved 
recognition  of  the  existence  of  nasal  segments. 

This  study  led  to  the  following  conclusions: 
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(i)  The  transition  to  and  from  nasal  segments  can  be  effectively 
characterized  by  the  time-varying  characteristics  of  four 
simple  acoustic  measures,  the  relative  energy  change  in  the 
frequency  bands  0-1,  1-2  and  2-5  kHz  and  the  frequency  centroid 
of  the  0-500  Hz  band.  Using  multivariate  statistics  on  four 
samples  of  these  measures  Spaced  12.8  msec  apart,  a 91  percent 
correct  nasal/nonnasal  decision  rate  can  be  attained  for  data 
of  two  speakers.  This  categorization  rate  is  significantly 
better  than  could  be  attained  using  stationary  statistics  on 
the  same  or  similar  measures. 

(ii)  Careful  selection  of  the  point  of  maximum  spectral  change  is 
critical  to  the  success  of  the  procedure.  The  usefulness  of 
these  measurements  toward  the  separation  of  nasal  and  nonnasal 
categories  drops  rapidly  as  one  moves  away  from  the  point  of 
maximal  spectral  change. 

6.  Distance  Measures  for  Speech  Recognition 

The  most  useful  metric  to  represent  the  acoustic  similarity  of  unknown 
syllables  is  one  that  can  also  predict  the  perceptual  similarity  of  those  syl- 
lables. The  aim  of  this  exploratory  study  (Mermelstein,  1976b)  was  to  review 
available  data  from  speech  perception  and  speech  transmission  studies  on  the 
confusability  of  speech  sounds  and  to  express  this  confusabllity  as  a multi- 
dimensional distance  in  phonetic  space.  Some  desirable  properties  that  dis- 
tance measures  should  possess  for  the  accurate  verification  of  syllable  hypo- 
theses were  identified  in  the  light  of  the  perceptual  data. 

Our  experimental  study  explored  the  use  of  a two-dimensional  mel-based 
cepstral  distance  measure  of  the  distance  between  many  unknown  syllables 
spoken  by  various  speakers.  Syllable  templates  obtained  by  combining  infor- 
mation from  all  available  productions  of  a given  monosyllabic  word  could  be 
used  to  maximize  the  similarity  between  an  unknown  token  of  a word  and  its 
stored  template.  Additionally , the  significance  of  local  cepstral  differences 
was  assessed  with  the  aid  of  estimates  of  the  variability  of  those  measures 
over  the  set  of  known  productions  of  that  word. 

The  study  concluded  that : 

(i)  An  ability  to  weigh  observed  differences  according  to  their  signi- 
ficance was  necessary  for  successful  verification. 

(ii)  Speech  synthesis  techniques  can  yield  representative  tokens  of 
the  hypothesized  syllables  which  are  acceptable  to  a listener 
but  they  do  not  provide  information  concerning  the  degree  to 
which  variations  from  the  tokens  may  occur.  Therefore  stored 
templates  augmented  by  variability  information  offer  a better 
short-term  solution  to  the  verification  problem. 
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7.  Acoustic  Detenainants  of  Stop-consonant  Place  Perception 

Identification  of  the  place-of-production  feature  of  stop-consonants 
is  a difficult  task  for  any  speech  recognition  system.  A study  reported 
by  Kuhn  (1975)  focused  on  a theory  for  perceptual  place  assessment  based 
on  distinguishing  the  front-cavity  resonance  (i.e.,  the  resonant  frequency 
of  the  cavity  of  the  vocal-tract  lying  immediately  anterior  to  the  major 
constriction  that  produces  the  consonant).  The  study  showed  that,  if  one 
assumes  that  the  front-cavity  resonance  can  be  detected  by  human  perceptual 
mechanisms,  one  can  explain  certain  perceptual  behaviors  with  synthetic 
speech  stimuli  that  would  otherwise  appear  anomalous  if  attempts  were  to  be 
made  to  explain  them  solely  on  the  basis  of  acoustic  considerations.  Addi- 
tional data  on  perceptual  measurements  are  included  in  a Ph.  D.  Thesis  en- 
titled "An  Experimental  Study  of  the  Acoustic  Determinants  of  Stop  Conso- 
nant Place  Perception:  Observations  from  the  Synthesis  of  Single  Formant 
Stimuli"  to  be  submitted  to  the  Department  of  Linguistics  at  the  University 
of  Connecticut  by  G.  M.  Kuhn.  This  research  study  attempted  to  assess  the 
role  of  resonances  of  the  front  cavity  of  the  vocal-tract  in  the  perception 
of  intervocalic  stop  consonants. 

The  relevance  of  this  study  to  the  automatic  analysis  of  acoustic  speech 
data  was  based  on  three  hypotheses : 

(a)  A front  cavity  resonance  frequency  estimate  can  be  obtained  automati- 
cally from  the  information  in  the  speech  signal. 

(b)  The  cues  for  place  of  articulation  of  consonants  can  be  described  con- 
cisely from  an  articulatory  viewpoint  and  the  front-cavity  resonance  serves 
as  an  aid  to  the  listener  in  assessing  the  place  of  articulation. 

(c)  A front-cavity  resonance  frequency  estimate  may  serve  as  a speaker-inde- 
pendent articulatory  reference.  Front-cavity  lengths  may  be  more  similar 
across  speakers  than  the  lengths  of  their  entire  vocal  tracts. 

Kuhn  concluded  that: 

(i)  A front-cavity  frequency  estimate  can  be  made  from  speech  data 
by  weighting  the  spectra  by  the  middle  and  inner  ear  transfer 
functions,  converting  the  frequency  scale  to  mels,  smoothing 
with  filters  having  equal  bandwidths  in  mels,  a d selecting 
the  most  prominent  spectral  peak. 

(ii)  The  relative  contributions  of  the  second  or  third  formant  fre- 
quency to  stop-consonant  place  of  perception  changes  as  the  front- 
cavity  affiliation  of  those  formant  change.  The  formant  that  con- 
tributes most  to  correct  place  perception  appears  to  be  the  one 
that  is  most  closely  associated  with  the  front  cavity. 
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8.  Acoustic  Determinants  of  Perceived  Prominence 

This  study  (Gaitenby,  1976)  sought  to  find  acoustic  indicators  of 
the  most  prominent  syllables  within  an  unknown  utterance.  Since  the  most 
prominent  syllables  generally  carry  more  detailed  acoustic  information, 
acoustic-based  hypotheses  concerning  the  phonetic  makeup  of  such  syllables 
are  more  likely  to  be  correct  than  are  the  hypotheses  for  the  less  prominent 
syllables.  A vocabulary  organized  around  certain  syllables  that  are  likely 
to  be  prominent  can  then  be  used  to  retrieve  hypotheses  concerning  the  less 
prominent  syllables.  The  study  computed  a weighted  intensity  function  for 
the  speech  signal  and  compared  a measure  of  syllable  prominence  derived  auto- 
matically from  that  function  with  listeners'  perceptual  judgments  of  the 
relative  prominence  of  the  syllables.  Weighted  intensity,  a measure  roughly 
representing  the  relative  loudness  of  speech  as  a function  of  time,  was  used 
previously  for  segmentation  of  the  signal  into  syllable-sized  units. 

Our  conclusions  were  that  weighted  intensity  is  also  a reliable  indi- 
cator of  relative  prominence  among  syllables.  Predictions  based  on  that 
measure  were  in  substantial  agreement  with  perceptual  judgments  of  the  same 
speech  material. 

9.  A Digital  Pattern  Playback  for  the  Analysis  and  Manipulation  of  Speech 

Signals . 

The  Digital  Pattern  Playback  (Nye,  et  al.,  1975)  is  a new  computer-based 
research  tool  for  the  analysis  manipulation  and  resynthesis  of  speech  data. 

Its  original  design  was  supported  by  a grant  from  the  National  Science  Founda- 
tion and  its  construction  was  completed  under  this  contract.  The  Digital 
Pattern  Playback, which  is  similar  to  an  earlier  analog  pattern  playback, 
permits  the  generation  of  artificial  speech  sounds  containing  variants  of 
the  features  being  studied.  An  important  addition  is  the  ability  to  display 
gray-scale  digital  spectrograms  practically  instantaneously  after  the  utterance 
is  spoken.  Through  a connection  to  a general-purpose  computer,  previously  re- 
corded utterances  can  be  retrieved  and  compared  to  newly  recorded  speech. 
Furthermore,  focusing  on  the  perceptual  differences  resulting  from  small  spec- 
trographically  defined  changes  in  the  acoustic  signal,  the  perceptual  effects 
of  acoustic  features  can  be  rapidly  evaluated. 

This  instrument  has  already  received  extensive  use  f >r  the  analysis  of 
acoustic  features  of  utterances  such  as  voice-onset  time,  nasal  resonances, 
segmental  durations  and  formant  frequency  variations.  It  has  also  been  ex- 
tensively used  in  simulating  automatic  feature  assignment  to  unknown  utterances 
prior  to  the  implementation  of  an  automatic  extraction  routine  that  isolated 
those  features  in  an  exploratory  speech-recognition  system  (see  Section  1). 

10.  An  ARPANET  software  Interface  for  DEC  PDP-11/45 , R3X-11D. 

An  interface  package  of  four  programs  (tasks)  has  been  designed  to 
operate  on  the  Digital  Equipment  Corporation  P£P-ll/45  computer  under  the 
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RSX-11D  operating  system.  The  purpose  of  the  undertaking  was  to  make  possible 
the  exchange  of  data  files  and  messages  relating  to  the  contract  work  between 
the  Laboratories  and  other  ARPA  contractors  engaged  in  the  speech  understand- 
ing program.  On  completion,  the  interface  package  was  transmitted  to  about 
12  other  PDP-11/45  users  connected  to  the  ARPA  net  who  had  expressed  the  need 
for  Network/RSX  compatibility. 

The  software  package  uses  a Very  Distant  Host  (VDH)  hardware  interface 
manufactured  by  A Consultant.  It  consists  of  a device  handler  called  VD  (for 
the  VDH  interface)  which  contains  the  logic  for  the  Reliable  Transmission 
Packet  protocol.  A second  psuedo  device  handler  called  NT  implements  the 
IMP-HOST  and  HOST-HOST  protocols.  The  third  component  is  a task  called  TELENET  • 
that  allows  the  user  to  connect  his  terminal  to  any  server  HOST  on  the  network 
through  NT.  Finally  FTP  is  a user  task  that  implements  the  network  standard 
File  Transfer  Protocol.  This  routine  allows  the  user  to  connect  to  a remote 
HOST,  perform  manipulations  upon  its  file  system  and  transfer  files  between 
the  remote  system  and  the  local  RSX-11D  file  system.  At  the  present  time  only 
ASCII  files  may  be  transferred  in  this  manner. 

Details  of  the  package  structure  and  operating  characteristics  are  de- 
scribed in  an  operators  manual  intended  for  use  in  the  Laboratories  (McGuire, 

1975) . 
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